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ABSTRACT 

General-purpose multiprocessors (as, in our case, Intel Ivy- 
Bridge and Intel Haswell) increasingly add GPU computing 
power to the former multicore architectures. When used for 
embedded applications (for us. Synthetic aperture radar) 
with intensive signal processing requirements, they must 
constantly compute convolution algorithms, such as the fa¬ 
mous Fast Fourier Transform. Due to its "fractal” nature 
(the typical butterfly shape, with larger FFTs dehned as 
combination of smaller ones with auxiliary data array trans¬ 
pose functions), one can hope to compute analytically the 
size of the largest FFT that can be performed locally on 
an elementary GPU compute block. Then, the full appli¬ 
cation must be organized around this given building block 
size. Now, due to phenomena involved in the data transfers 
between various memory levels across GPUs and GPUs, the 
optimality of such a scheme is only loosely predictable (as 
communications tend to overcome in time the complexity 
of computations). Therefore a mix of (theoretical) analytic 
approach and (practical) runtime validation is here needed. 

As we shall illustrate, this occurs at both stage, hrst at the 
level of deciding on a given elementary FFT block size, then 
at the full application level. 
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1. INTRODUCTION 

The Fast Fourier Transform (FFT) Algorithm is one of 
the top ten algorithms of the 20th century and it is a 
basic building block of many signal processing algorithms, 
including defense systems (warfare radars for example). In 
such applications one generally needs to compute a number 
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of FFTs iteratively, of large size(s). There is a plethoric 
literature on how variants of FFT algorithm (with different 
number of stages, radixes, etc) may be preferred in general 
depending on performance features of distinct computing ar¬ 
chitectures . But such choices are only tenden- 

cial, given that large FFT computations divided into several 
stages include a lot of data movements to reorder temporary 
outputs, and that, while computation costs can usually be 
accurately assessed, those memory transfers are often only 
loosely predictable. 

The case of modern processors such as Intel IvyBridge 
and Haswell, where GPU is combined with GPU accelera¬ 
tor kernels, adds to this issue’s complexity: there exists a 
given size of register memory available in each GPU compu¬ 
tational unit (thereafter called EU, execution uni£)^ which 
can be used to dehne the largest FF T b lock size that can 
be computed in a fully local fashion [^. Now this block 
could be used as new modular unit for granularity, so that 
the full-size FFT, and then the whole application, is built 
around such coarse-grain modular units. But then the com¬ 
munication costs themselves may be prohibitive in that ver¬ 
sion (beyond being loosely predictable only). So the pro¬ 
posed approach first attempts to compute analytically some 
sort of a ’’best version” of the FFT algorithm when regard¬ 
ing the adequation between the FFT computation and the 
GPU processing and storage power, and then adjust this 
ideal solution by more practical experiment benchmarking 
regarding the data transfer and communication efficiency. 

2. BACKGROUND 
2.1 FFT Basics 

The FFT factorizes the DFT to reduce the number of 
computations from O(N^) to 0(N.log{N)) for faster evalu¬ 
ation of the discrete Fourier transform (DFT). The discrete 
Fourier transform of a signal of N complex samples is given 
by: 

N-l 

with un = 

n=0 

The Wn coefficients, commonly known as the twiddle fac¬ 
tors, are generally precomputed and stored in memory for 
reuse. The backward DFT is obtained by changing the 


sign of the Wn exponent. The FFT algorithm runs multi¬ 
ple stages which are a set of mutiply-add operations named 
radix or butterfly. For instance, a 2^^ samples radix2 FFT 
has 12 stages of 2^^ radix2 operations or 6 stages of 2^° 
radix4 operations (or 4 stages of 2^ operations). If the num¬ 
ber of samples does not allow to do exclusively one type of 
radix operations (for instance one cannot complete a 1024- 
samples FFT with only radixS stages), then different radix 
are used (figure 1) and the implementation is called a mixed- 
radix FFT. 



to designers. This is especially true for radar systems used 
on unmanned aerial vehicles (UAV). For radar applications, 
GPGPU is now becoming an efficient ingredient to reduce 
the footprint of the solution when customers want to deploy 
new radar systems across the world. 

Exploiting the integrated GPU in the GPU to perform use¬ 
ful radar computations results in the reduction of hardware 
cost and increases the energy efficiency of the system. 

SAR(Synthetic aperture radar) is typically mounted on a 
moving platform such as an aircraft or spacecraft, and it 
originated as an advanced form of side-looking airborne radar, 
The distance the SAR device travels over a target creates 
a large ’’synthetic” antenna aperture (the ’’size” of the an¬ 
tenna) . 
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The mixed-radix implementations are popular because lower 
radix generally offer poor performance. Most of the signal 
processing applications which use the FFT perform their op¬ 
erations on a power-of-two number of input samples dataset, 
with a dataset usually not larger than 2^^. Thus we focused 
on the mixed-radix implementation. 

2.2 Radar application 


fladar Signal Processing: 



Figure 2: Hardware choices for designing an embed¬ 
ded radar system 

With any military electronics application, balancing size, 
weight and power (SWaP) presents the biggest challenges 


Figure 3: synthetic aperture radar range doppler 
algorithm 

in this study we will focus on synthetic aperture radar and 
especially on Range doppler algorithm (RDA) (figure 2). 

2.3 Integrated Intel GPGPUs 

The analysis and experiments will be conducted with in¬ 
tegrated GPUs. The presence of a GPU and a GPU on the 
same die opens significant opportunities for parallel algo¬ 
rithms to be accelerated by the GPU. An integrated GPU 
shares RAM with the GPU (figure 2). This also means some 
of the cache levels are shared between the GPU and the 
GPU. In that specific case the L3 cache is shared. The inner 
SIMD multithreaded processor (EU: execution unit) archi¬ 
tecture is not disclosed. It is not studied in detail but its 
effects are taken into account in the approach. The number 
of execution units and their capabilities vary with different 
GPU implementations!^. Eor instance the GPU integrated 
in Haswell has 40 EUs (HD Graphics 5200GT3e) while the 
one integrated in the Ivy Bridge GPU has 16 EUs (HD 
Graphics 4000) (figure 2). The experiments are conducted 
with these two integrated GPU/GPU. The third and fourth 
generation of Intel GPUs (IvyBridge and Haswell) imple¬ 
ment the AVX SIMD unit that can provide 8 single preci¬ 
sion parallel computations (256bit); maximizing the usage of 
this unit is not yet efficiently automated by the state of the 
art compilers, this task is still dedicated to the programmer 
that can access this unit through intrinsics. Haswell also in- 























































































Figure 4: A generic integrated GPU architecture 


troduces the fused multiply add instruction and the AVX2.0 
SIMD units [^. 

3. APPROACH 

We consider two different layer of optimization, which cor¬ 
respond to two different architecture layers. We will call 
those two levels of the optimization process respectively the 
top-level and bottom-level analysis.The bottom-level analy¬ 
sis attempts to optimize the run of one 1024-samples FFT 
on one computing element (EU).The top-level analysis at¬ 
tempts to optimize the run of a 1024 batch of 1024-samples 
FFT on the whole system which is a multicore CPU with an 
integrated multi-EU GPU. 

In our approach the ID EET is tailored to fit the inte¬ 
grated GPU architecture by the means of OpenCL (Open 
Computing Language)]^. The OpenCL framework defines 
two possible ways to express parallel computations, the first 
one is the SIMT (Single instruction multiple Threads) pro¬ 
gramming style that makes an abstraction of the SIMD (Sin¬ 
gle instruction multiple Data) units, so the programmer is 
freed from the vectorization issues; the second one is the 
SIMD vector style. Our experiments showed that the SIMT 
fashion is more appropriate for taking benefit of the scala¬ 
bility of GPU with many cores. Open standards are more 
trustable and OpenCL is now a de facto standard in com¬ 
parison with Nvidia CUDAp^. 


Less flexible More flexible 



Figure 5: Explicit parallelism and his flexibility 


The FFT computation was expressed as a sequence of six 


FMA operations(multiply-Add), this transformation allows 
using the full width of the SIMD units within each GPU 
core (also known as: EU). 

We have identified three key limiting factors that were used 
to design our high performance FFT algorithm: 

• Number of registers available on the integrated GPU 

• Size of shared memory 

• The on-chip interconnect 

We have hne-tuned our parallel FFT implementation to max¬ 
imize the number of floating point operations and to mini¬ 
mize the communications overhead between threads. 


3.1 Radar application description 

In order to simplify the understanding we consider a sim¬ 
plified application model (figure 4). This application!^ is 
massively parallel as for one complete execution the first and 
last task have to run 1024 times (while the transposition 
runs only once)(figure 6), and those runs are independent. 
Those many FFT occurrences are mapped on the Streaming 
Multiprocessors and the GPU cores. One of the outcomes of 
this study is to determine how to balance the load between 
the GPU and the GPU. 

Moreover some parameters of the FFT block implemen¬ 
tation can be tuned as to reach better performance thus we 
study this block more precisely on the computing elements 
(EU and GPU core) which will be used to execute it. 



Figure 6: The radar application and the inner FFT 
block description (8-samples radix2) 


3.2 Bottom level analysis 

3.2.1 Insights: 

Butterflies can be clustered into higher order radix. In or¬ 
der to minimize the stress on the memory hierarchy, we con¬ 
sider that the highest possible radix is the one which input 
data hts in the private memory of the processing elements. 
Some simple experiments show that different clusterings will 
cause different performance gain or loss on the GPU. With 
FFTiist of radixes the performance of the FFT implementa¬ 
tion when the list of radixes is executed, we get on Ivy 
















































































































tures. 


Bridge: 


^ ^ J-8,8,8,2 

With the insight that FFT 8 , 8 , 8,2 is the optimal implemen¬ 
tation, this shows that the naive implementation is much 
slower than the assumed optimal implementation. We be¬ 
lieve however that there exists another list of radixes which 
performs better than FFTg^g, 8 , 2 - 

The number of different mixed-radix FFT implementa¬ 
tions can be very large even for a small set of radixes and 
ordinary FFT sizes. In our representation it corresponds to 
the number of distinct paths between the source and the sink 
nodes. The number of stages that we need to experiment in 
this approach is 3.{N — 1) with N = log 2 {nsamples). For 
instance with a 4096-samples 2,4,8 mixed-radix FFT run¬ 
ning on a 3.60 GHz Intel Xeon Pentium (according to the 
performances from the FFTW benchmark page), testing the 
whole set of combinations would last about 12 seconds while 
testing only our subset of benchmarks would last much less 
than one second (only 33 stages runs). We can thus deter¬ 
mine a performance model which can help us to choose one 
particular FFT in order to get performance gains. 


FFT size 

Number of 

mixed-radix 

Number of exper¬ 
iment stage runs 

16 

7 

9 

32 

13 

12 

64 

24 

15 

128 

44 

18 

256 

81 

21 

512 

149 

24 

1024 

247 

27 

2048 

504 

30 

4096 

927 

33 

8192 

1705 

36 

16384 

3136 

39 


Figure 7: Number of mixed-radix possible imple¬ 
mentations vs FFT size 


We identified a bandwidth bottleneck as a limiting fac¬ 
tor; this was measured by a memory benchmark between 
the CPU and the integrated GPU. 

Our experiments showed that 5GBytes/s is the maximum 
measured bandwidth on the Intel Ivy Bridge (IVB) ; this 
gives us also a hint about the maximum achievable FFT 
GFlops in this architecture. Let B be the memory bandwidth 
(GB/s) , N the FFT size and Tmax the maximum through¬ 
put ( GFlops) 


5.iV. \og^{N)^B 

2 * 4 * A/" 


( 1 ) 


The theoretical maximum performance(l) is 32GFlops for a 
IK complex FFT. 

This was also confirmed by running our previous CPU-GPU 
bandwidth test on the fourth generation Intel GPU inte¬ 
grated in the Haswell (HSW) CPU; the maximum band¬ 
width being lOGBytes/s. This gives us an upper bound of 
the GPU maximum performance throughput of 62GFlops. 
All these theoretical bounds were verified by our implemen¬ 
tation of the FFT (figure 10) on these two GPU architec- 


3.2.2 FFT performance model: 

One 1024-samples FFT execution is a successive number 
of radixes of potentially different orders. Looking at the FFT 
stages costs shows that there is no straightforward relation¬ 
ship between the index of the stage and its cost. Under the 
hypothesis that the radix stage performance depend only on 
the index of the stage (and not on which stages have been 
run before or will run after) we thus need to benchmark the 
possible radix runs and find out which combination yields 
the best performance. The hypothesis is admissible because 
two stages of the same FFT cannot run in parallel for al¬ 
gorithmic reasons (a synchronization barrier exists between 
two stages). 

The combination of possible radix schedules can be de¬ 
scribed with a digraph where every path from the input 
node to any output node is a valid execution. Every edge is 
weighted with the benchmarked cost and the minimum path 
from source to sink is the optimal mixed-radix implementa¬ 
tion. The figure [^provides the state space of the 32-samples 
FFT for radixes 2, 4, and 8. For clarity the edge is anno¬ 
tated with the executed radix instead of its cost, and the 
nodes are annotated with the current stage. The number of 
possible mixed-radix implementations is the number of dis¬ 
tinct paths from the node indexed 0 to the nodes indexed 
5. If the edges are weighted with the cost of the executed 
radix, the shortest path in this list of paths is the optimal 
mixed-radix 32-samples FFT implementation on Ivy bridge. 



Figure 8: Admissible mixed-radix space for a 2^ fft 

In order to determine this path the user needs to bench¬ 
mark: 

• radix 2 starting at stage 0, 1, ... 9 

• radix 4 starting at stage 0, 1, ... 8 

• ... until the highest possible radix (radix 8 in our ex¬ 
periments) 

The weights on graph (figure 8) are the selected radixes, 
to optimise our FFT code we used a shortest path algo¬ 
rithm (in our case: Dijkstra’s algorithm). Our benchmarks 
will provide us precise values depending on the underlying 
hardware. 





















3.2.3 CPU mixed-radix implementation: 

The CPU cores FFT implementation is straightforward 
because the SIMD units will clearly yield the best perfor¬ 
mance compared to the regular floating point units, and the 
size of the SIMD units is fixed and known. The most efficient 
radix is thus the one which fills the SIMD unit completely. 
Thus the FFT implementation on the CPU is potentially not 
the same than the one on the GPU. We implemented an op¬ 
timized version of FFT using Intel intrinsics to take benefit 
from the AVX2 units, our implementation shows very close 
performances to ones provided by the Intel IPP library. We 
get 21GFlops and ipp provides 22GFlops in the same con¬ 
ditions. 

3.3 Other FFT sizes 

The approach can be adapted to other FFT sizes. The 
largest fft basic bloc is fixed in our approach but thanks 
to the recursive aspect of the FFT, larger FFTs can still 
be realized following the same methodology, with the same 
outcomes. Because the Streaming Multiprocessors private 
memories can hold only 1024 complex samples, and if no 
other computing resource can hold 4096 samples the stages 
11 and 12 need to be split (for instance according to p^ ) 
which means the application dataflow graph needs to include 
this split and merge, which will increase the need for message 
passing. Applications which require smaller FFTs can follow 
the same approach. 



Figure 9: 4096-samples FFT described as a combi¬ 
nation of 1024-samples FFT 


3.4 Top level analysis 

We intentionally map no more than one FFT to one EU: 
The EU can hold in its private memory the whole EET sam¬ 
ple set (and no more than one). Once the cost of a EET is 
determined, we assume it does not depend on what the other 
processing elements of the system are doing. This hypothe¬ 
sis is assumed correct because we restricted the sizes of input 
data such that it fits into the private memories of the pro¬ 
cessing elements, thus the processing elements will not load 
much the shared memory hierarchy apart at the first stage 
(reading) and at the last stage (writing) of the EET. The 
performance model for the study of the radar application re¬ 
lies on a dataflow description, annotated with sizes on edges 
and cost on nodes. This behavior (Read/Compute/Write) 
allows us to predict better how the memory hierarchy is be¬ 
ing loaded. Thus we assume that the task (nodes on figure 


• Read on all its inputs 


3.1) semantics are: 


• Compute 

• Write on all its outputs 

The experiments show that the performance can vary de¬ 
pending on how the EET blocks are clustered together on 
the CPU and on the GPU. Thus there is an optimal reparti¬ 
tion ratio which depends on the number of EETs that need 
to be executed. The transposition block is not studied but 
it is well known that a matrix transpose on a large matrix 
can be split into smaller block thus given the size of the 
basic block, the same analysis could be conducted. 

4. RESULTS 

We compare in (figure 9) the performance of our EET on 
integrated GPUs (red and blue) to the Intel IPP EETp!?] 
on the CPU (green). The memory bandwidth gives us a 



^HSWGPU 
♦IVBGPU 
-^IVB CPU(AVX) 


Figure 10: Performance scaling of the FFT compu¬ 
tation 

precious indication about our optimization performance, the 
number of available registers gives us a valuable hint about 
which algorithm to choose and what is the granularity of 
our FFT implementation (Radix-8 in our case) and finally 
the shared memory space defines the number of threads that 
must be used to process an N size FFT. 

We measured the energy consumption during the FFT 
computation. The measures in (figure 10) show a low en¬ 
ergy footprint of the integrated GPU (4.6W) compared to 
the CPU. We also noticed that computing the FFT by the 
integrated GPU entirely frees the CPU for other tasks such 
as handling the data stream from the sensors and executing 
the communications software stack. 

Provided the results of figure we are able to determine 
that FFT 4 ^ 8 , 8,4 is the optimal mixed-radix 1024-samples FFT 
implementation, which shows the naive FFTs, 8 , 8 , 2 was not 
the optimal one. The obtained FFT 4 , 8 , 8,4 has 5% perfor¬ 
mance gain compared to the FFTs, 8 , 8 , 2 and 31% perfor¬ 
mance gain compared to the naive FFT 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2,2 im¬ 
plementation. 

^^^8,8,8 ,2 1 r FT2^2,2,2,2,2,2,2,2,2 1 Q1 

F'FT4,8,8,4 ~ i^i^T4,8,8,4 “ 

The same experiments are conducted on the Haswell GPU 
(figure 13). FFT 4 , 8 , 8,4 has also been found as the optimal 
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Figure 11: Energy efficiency expressed in GFlops 
per Watt 


stage 

radix 2 

radix 4 

radix 8 

0 

1600 

3100 

4135 

I 

2430 

3660 

4830 

2 

2600 

3600 

5520 

3 

2658 

4002 

5988 

4 

2560 

4213 

6480 

5 

2790 

3910 

7320 

6 

2600 

4632 

7896 

7 

2889 

4510 

7887 

8 

3512 

5030 

X 

9 

3913 

X 
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Figure 12: Intel Ivy Bridge GPU benchmarks subset 


stage 

radix 2 

radix 4 

radix 8 

0 

1514 

2813 

3927 

I 

2295 

3463 

4569 

2 

2460 

3407 

5223 

3 

2513 

3785 

5666 

4 

2427 

3985 

6129 

5 

2640 

3695 

6930 

6 

2460 

4385 

7472 

7 

2737 

4266 

7445 

8 

3321 

4240 

X 

9 

3705 

X 

X 


desired CPU load. Given: 



Number of FFTs 


^IVB CPU ratio H^HSW CPU ratio 

^HSW CPU ratio with 50 % CPU load— IVB CPU ratio with 50 % CPU load 


Figure 14: The obtained performance for CPU and 
GPU FFT implementations and the maximum per¬ 
formance CPU ratio 


• Scpu the number of FFT which run on the CPU 

• Sgpu the number of FFT which run on the GPU 


• S the 1024 FFT batch (the total amount of work to 
be processed) 

• Pcpu the performance obtained on the CPU 

• Pgpu the performance obtained on the GPU 

• CPUratio ^ [0,1] the normalized amount of FFTs which 
runs on the CPU. 

Scpu _ Sqpu 
Pcpu Pgpu 

and 

Scpu + Sgpu = S CPUratio = , 


For the 1024-FFT batch we obtain a performance of 40 
GFlops for the CPU and 55 GFlops for the Haswell GPU. 
Thus the optimal CPUratio is 42%. In order to obtain a 50% 
load on the CPU, the CPU ratio needs to be set to 27%. 


Figure 13: Intel Haswell GPU benchmarks subset 


mixed-radix implementation with the following performance 
gains: 


PPP8,8,8,2 _ -1 rjo P PP 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2,2 ,2 

FFT4,8,8,4 ~ FFT4,8,8,4 


1.36 


The chosen CPU implementation is FFTs,8,8, 2 . We use 
the SIMD units of the CPU. Yet higher radixes could provide 
better performance as the fftw-wisdom tool suggests]^. The 
heterogeneous evaluation is conducted on Haswell. 

It might be valuable to consider that the CPU can be busy 
with other tasks, most typically handling TCP/IP commu¬ 
nications. Thus it must not be saturated with signal pro¬ 
cessing work. The ratio can be adapted depending on the 


5 . FUTURE WORK 

The present paper investigates how, under a strict spec- 
ihcation of GPU EU sizing, one can deduce a theoretically 
optimal FFT modular brick size, and then build the spa¬ 
tial and temporal organization of a full application by us¬ 
ing such elementary algorithmic component. Because of less 
predictive behaviors on data transfers and communications, 
experimental validation may be needed to grasp fast tuning 
of the whole design. Approaches using Worst-Case Execu¬ 
tion Time (WCET) for the data movement operations could 
unify the latency models, but at the risk of suboptimality, 
while the choice of which variant of EET to use may be very 
sensible to data transfer (either less but larger, or more fre¬ 
quent but smaller). In the future we would like to consider 
more involved program shapes for surrounding applications. 
































































































































and other types of parametric algorithms than FFT. The 
objective shall remain, to consider further how the interplay 
between theoretically optimal design campaigns, based on 
simplified timing assumptions, can be finely re-tuned after¬ 
wards by practical experiments, comforting or challenging 
the theoretical optimality, due to real phenomena hard to 
predict at model level (such as the relative imprecision of 
data transfer latency in grey box setting of merged traffics). 

5.1 Related work and optimized mapping 

We provide a brief overview of the related work on opti¬ 
mization techniques that target FFTs on GPUs and CPUs. 
Van Loan provides an overview of FFT algorithms and 
their variants. Frigo and Johnson presented FFTW an 
adaptive library for the efficient computation of FFT of 
real and complex data of arbitrary dimensions and sizes 
on many architectures. It employs a two-stage adaptation 
methodology to adapt to microprocessor architecture and 
memory hierarchy. At the installation time, the code gen¬ 
erator automatically generates highly optimized small DFT 
code blocks called codelets. At run-time, the pre-generated 
codelets are assembled in a plan to compute large FFT prob¬ 
lems. The space representing various compositions of factor¬ 
izations and algorithms for a given size FFT is explored to 
find the best plan of execution. Spiral is a generator for 
optimized FFT libraries on CPUs and FPGAs. UHFFT 
tries to find the best schedule of execution through better 
understanding of the correlation between the schedules and 
their performance on modern architectures. The hardware 
characteristics of GPUs vary widely with newer gener¬ 
ation: GPUs now offer better memory hierarchy support, 
including larger local storage and register file sizes, memory 
bus widths, etc. Thus it is non-trivial to optimize these al¬ 
gorithms for a distinct range of GPUs. 

The idea of computing FFTs with 6 FMA operations per 
butterfly is present in [^. Adjusting the variant of FFT al¬ 
gorithm of that sort to the number of local registers can be 
found inp^. 

While we started this work independently from these sources, 
our original contribution remains in the study of the subse¬ 
quent data traffic between CPU, global memory and local 
registers, and its ability (not quite full) to cope with the full 
computation bandwidth, while this inability is compensated 
by the gain in energy spent, a clear winner for embedded 
computing. 


6 . CONCLUSION 

The described approach shows that different modeling lev¬ 
els can be used in order to achieve overall optimization of 
a signal processing system. The integrated GPGPUs of¬ 
fer a new powerful shared memory vector unit which still 
offers more mapping and scheduling decisions. Exploring 
these choices exhaustively can be very time consuming. In 
our approach we reduce this time-consuming task to a set of 
simple operations which can help to decide (provided our hy¬ 
pothesis are verified) the optimal solution. We consider our 
findings are a step further toward a best performance scal¬ 
ing on modern parallel embedded systems, and also a real 
opportunity for embedded system designers to make energy 
efficient systems. 

Our FFT algorithm can be used in complex and critical 
applications and a significant performance improvement is 


expected for no effort. 

We believe that GPGPUs bringing a lOOX speedup and more 
is a pure myth, and realistic expectations are in general be¬ 
low half of the available computation power, we can clearly 
argue that 40% of the peak performance can be reached with 
a relatively small adaptation effort of the algorithm. This 
is a valid expectation with the existing architectures, but 
with the new unified heterogeneous memory hUMA (hetero¬ 
geneous Uniform Memory Access) that is used in the HSA 
(Heterogeneous System Architecture) , we anticipate pro¬ 
ductivity increase for the programmer and significant per¬ 
formance gain. 
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